STATS 32 Session 4: Data Visualization (continued)
Kenneth Tay
Oct 3, 2019
Final project
Goal: Demonstrate that you know how to do data analysis in R
Can be done individually or in a pair.
Minimum requirements:
- 1 R Markdown file and 1 HTML file
- Use a dataset that we have not used in class
- “Introduction”, “Data analysis” and “Conclusion” sections
- At least 3 data visualizations, each of a different type (6 data visualizations with at least 3 different types if working with a teammate)
- Examples on the class website
- Due Nov 2 (Sat), 23:59:59
Project proposal
- 1-2 paragraphs long
- Details on the problem you wish to explore, datasets you will use, potential visualizations
- Due Oct 16 (Wed), 23:59:59
Recap of session 3
- Different types of plots
- 1 categorical variable: barplot
- 1 continuous variable: histogram
- Continuous vs. continuous: scatterplot
- Continuous vs. time: lineplot
- Continuous vs. categorical: boxplots & violin plots
- Categorical vs. categorical: heatmap
- Visualizing data with
ggplot2
- Key elements: data, geometries, aesthetics
3 essential elements of graphics: data, geometries, aesthetics
Data: Dataset we are using for the plot
## mpg weight cylinders
## 1 21.0 2.620 6
## 2 21.0 2.875 6
## 3 22.8 2.320 4
## 4 21.4 3.215 6
## 5 18.7 3.440 8
## 6 18.1 3.460 6
## 7 14.3 3.570 8
## 8 24.4 3.190 4
## 9 22.8 3.150 4
## 10 19.2 3.440 6
3 essential elements of graphics: data, geometries, aesthetics
Geometries: Visual elements used for our data
- E.g. point, line, histogram, bar, boxplot
Geom: point
3 essential elements of graphics: data, geometries, aesthetics
Aesthetics: Defines the data columns which affect various aspects of the geom
- E.g. x, y, color, fill, size, alpha, line type, line width
- Which aesthetics you use depend on the geometries you choose
3 different aesthetics:
- x-axis: weight
- y-axis: mpg
- color: cylinders
- shape, size, etc. take on default values, not determined by data
Examples of other aesthetics
- x-axis: weight
- y-axis: mpg
- size: cylinders
- alpha: weight
Examples of other aesthetics
- x-axis: weight
- y-axis: mpg
- color: cylinders
- shape: cylinders
Agenda for today
- Layers
- Scales
- Facets
- Themes (“non-data ink”)
Layers: Combining multiple plots into one graphic
We can have more than one layer in a graphic.
= +
Each layer contains (essentially):
- 1 dataset, 1 geometric object, aesthetic mappings
ggplot2
code
ggplot2
code
When layers share attributes, we only have to type them once:
ggplot2
code
- We can drop
data =
if it is the first argument of ggplot()
- We can drop
mapping =
if:
- it is the second argument of
ggplot()
- it is the first argument of a
geom_xx()
function
Scales
- Aesthetics only tell you which column corresponds with which aesthetic (e.g. cylinder -> color, mpg -> x)
- Does not tell you how to do the mapping
- E.g. Which color should represent which cylinder value?
- E.g. What should the range of the x-axis be?
- Scales define that for you
- (On the bright side, defaults usually ok)
Scales example: colors
Default colors
Manually chosen colors
Scales example: x- & y-axes
Default axis limits
Manually chosen axis limits
Facets
- Plotting different parts of our data on different canvases
- Can give a clearer, less cluttered picture
- We can facet by rows and/or columns
Themes
Refers to all non-data ink
- Titles, axis ticks & labels, background color, legend, etc.
- Can manually set each item, or use preset themes
ggplot2
’s default theme
Minimal theme
More pre-set themes
Classic theme
Dark theme
We’ve only scratched the surface!
R Graph Gallery: an excellent source of inspiration and code snippet examples
Today’s dataset: Diamonds
What makes an expensive diamond?
Full specification of a graphic
One graphic contains:
- 1 or more layers
- Each layer has 1 dataset, 1 geometric object, aesthetic mappings, 1 statistic (default usually ok), 1 position (default usually ok)
- 1 scale for each aesthetic mapping (defaults usually ok)
- 1 coordinate system (default usually ok)
- facet specification (if any)
Other grammatical elements: statistics
Behind the scenes, R may need to do some transformation on the dataset to make the graphic.
- Each geometry has a default statistic, usually good enough
Other grammatical elements: position
Sometimes we need to tweak the position of the geometric elements because they obscure each other.
- E.g. jitter: randomly shifting points slightly
Only 9 data points??
Much better
Shapes in R
Colors in R
- By name: e.g. “blue”, “red”, “black”, “white” (full list here)
- By RGB value: e.g.
rgb(0,0,1)
, rgb(1,0,0)
, rgb(0,0,0)
, rgb(1,1,1)
- By hexadecimal value: e.g “#0000FF”, “#FF0000”, “#000000”, “#FFFFFF”
Color scales in R